Distribution of the number of significant effect sizes

In a study reporting multiple outcomes

effect size
distribution theory
Author

James E. Pustejovsky

Published

March 28, 2024

A while back, I posted the outline of a problem about the number of significant effect size estimates in a study that reports multiple outcomes. This problem interests me because it connects to the issue of selective reporting of study results, which creates problems for meta-analysis. Here, I’ll re-state the problem in slightly more general terms and then make some notes about what’s going on.

Consider a study that assesses some effect size across m different outcomes. (We’ll be thinking about one study at a time here, so no need to index the study as we would in a meta-analysis problem.) Let Ti denote the effect size estimate for outcome i, let Vi denote the sampling variance of the effect size estimate for outcome i, and let θi denote the true effect size parameter for corresponding to outcome i. Assume that the study outcomes [Ti]i=1m follow a correlated-and-hierarchical effects model, in which Ti=μ+u+vi+ei, where the study-level error uN(0,τ2), the effect-specific error viiidN(0,ω2), and the vector of sampling errors [ei]i=1m is multivariate normal with mean 0, known variances Var(ei)=σ2, and compound symmetric correlation structure cor(eh,ei)=ρ.

Define Ai as an indicator that is equal to one if Ti is statistically significant at level α based on a one-sided test, and otherwise equal to zero. (Equivalently, let Ai be equal to one if the effect is statistically significant at level 2α and in the theoretically expected direction.) Formally, Ai=I(Tiσ>qα) where qα=Φ1(1α) is the critical value from a standard normal distribution (e.g., q.05=1.645, q.025=1.96). Let NA=i=1mAi denote the total number of statistically significant effect sizes in the study. The question is: what is the distribution of NA.

Compound symmetry to the rescue

As I noted in the previous post, this set-up means that the effect size estimates have a compound symmetric distribution. We can make this a bit more explicit by writing the sampling errors in terms of the sum of a component that’s common acrosss outcomes and a component that’s specific to each outcome. Thus, let ei=f+gi, where fN(0,ρσ2) and giiidN(0,(1ρ)σ2). Let me also define ζ=μ+u+f as the conditional mean of the effects. It then follows that the effect size estimates are conditionally independent, given the common components: (Ti|ζ)iidN(ζ,ω2+(1ρ)σ2) Furthermore, the conditional probability of a significant effect is Pr(Ai=1|ζ)=Φ(ζqασω2+(1ρ)σ2) and A1,...,Am are mutually independent, conditional on ζ. Therefore, the conditional distribution of NA is binomial, (NA|ζ)Bin(m,π) where π=Φ(ζqασω2+(1ρ)σ2). What about the unconditional distribution?

To get rid of the ζ, we need to integrate over its distribution, which leads to Pr(NA=a)=E[Pr(NA|ζ)]=fNA(a|ζ,ω,σ,ρ,m)×fζ(ζ|μ,τ,σ,ρ) dζ, where fNA(a|ζ,ω,σ,ρ) is a binomial density with size m and probability π=π(ζ,ω,σ,ρ) and fζ(ζ|μ,τ,σ,ρ) is a normal density with mean μ and variance τ2+ρσ2.

This distribution is what you might call a binomial-normal convolution or a random-intercept probit model (where the random intercept is ζ). As far as I know, the distribution cannot be evaluated analytically but instead must be calculated using some sort of numerical integration routine.

Just the moments, please

If all we care about is the expectation of NA, we don’t need to bother with all the conditioning business and can just look at the marginal distribution of the effect size estimates taken individually. Marginally, Ti is normally distributed with mean μ and variance τ2+ω2+σ2, so Pr(Ai=1)=ψ, where ψ=Φ(μqαστ2+ω2+σ2). By the linearity of expectations, E(NA)=i=1mE(Ai)=mψ.

We can also get an approximation for the variance of NA by working with its conditional distribution above. By the rule of variance decomposition, Var(NA)=E[Var(NA|ζ)]+Var[E(NA|ζ)]=m×E[π(1π)]+m2×Var[π]=m×E[π](1E[π])+m(m1)×Var[π], where π is, as defined above, a function of ζ and thus a random variable. Now, E(π)=ψ and we can get something close to Var(π) using a first-order approximation: Var(π)(δπδζ|ζ=μ)2×Var(ζ)=[ϕ(μqασω2+(1ρ)σ2)]2×τ2+ρσ2ω2+(1ρ)σ2. Thus, Var(NA)m×ψ(1ψ)+m(m1)×[ϕ(μqασω2+(1ρ)σ2)]2×τ2+ρσ2ω2+(1ρ)σ2. If the amount of common variation is small, so τ2 is near zero and ρ is near zero, then the contribution of the second term will be small, and NA will act more or less like a binomial random variable with size m and probability ψ. On the other hand, if the amount of independent variation in the effect sizes is small, so ω2 is near zero and ρ is near 1, then the term on the right will approach m(m1)ψ(1ψ) and Var(NA) will approach m2ψ(1ψ), or the variance of m times a single Bernoulli variate. So you could say that NA has anywhere between 1 and m variate’s worth of information in it, depending on the degree of correlation between the effect size estimates.

Interactive distribution

Here is an interactive graph of the probability mass function of NA, with probability points calculated using Gaussian quadrature. Below the graph, I also report ψ, the exact mean and variance of NA, and the first-order approximation to the variance (denoted $V_{approx}). When τ>0 and ρ>0, the approximate variance is not all that accurate because the first-order approximation to Var(π) isn’t that good.

math = Object {isNumber: ƒ(e), isComplex: ƒ(e), isBigNumber: ƒ(e), isBigInt: ƒ(e), isFraction: ƒ(e), isUnit: ƒ(e), isString: ƒ(e), isArray: ƒ(), isMatrix: ƒ(e), isCollection: ƒ(e), isDenseMatrix: ƒ(e), isSparseMatrix: ƒ(e), isRange: ƒ(e), isIndex: ƒ(e), isBoolean: ƒ(e), isResultSet: ƒ(e), isHelp: ƒ(e), isFunction: ƒ(e), isDate: ƒ(e), isRegExp: ƒ(e), …}
norm = Module {Z: ƒ(…), cdf: ƒ(z), icdf: ƒ(…), intE: ƒ(a, b), pdf: ƒ(z), Symbol(Symbol.toStringTag): "Module"}
quad_points = Array(21) [Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), …]
sigma = 0.22360679774997896
zeta_sd = 0.19999999999999998
ID_sd = 0.17320508075688773
crit = 1.959963986120195
binomial_coefs = Array(7) [1, 6, 15, 20, 15, 6, 1]
probs = Array(21) [Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), Array(2), …]
p_binom_norm = Array(7) [0.3632983110161876, 0.18642239532408006, 0.13279056012006735, 0.10429406708513911, 0.0848789783106351, 0.0702147931381166, 0.058003374919324134]
psi = 0.3006337901985208
psi_print = "0.301"
E_NA = 1.803802741191125
E_NA_print = "1.804"
dpi_dzeta = 0.290096539705762
V_pi_approx = 0.11220800313234232
V_approx = 4.627758780306626
V_approx_print = "4.628"
V_NA = 3.60408188896073
V_NA_print = "3.604"

Distribution of NA

0.00.10.20.30.40.50.60.70.80.91.0↑ Probability0123456Number of significant effect sizes

Moments of NA

ψ=0.301E(NA)=1.804V(NA)=3.604Vapprox=4.628 \begin{aligned} \psi &= 0.301 \\ \mathbb{E}\left(N_A\right) &= 1.804 \\ \mathbb{V}\left(N_A\right) &= 3.604 &V_{approx} &= 4.628 \end{aligned}
m = 6
ESS = 80
mu = 0.3
tau = 0.1
omega = 0.1
rho = 0.6
alpha = 0.025
qp = 21
Back to top